Skip to content

Use multi-stage build for processing dependencies#420

Open
eyeseast wants to merge 1 commit into
masterfrom
fix-local-reprocessing-redaction
Open

Use multi-stage build for processing dependencies#420
eyeseast wants to merge 1 commit into
masterfrom
fix-local-reprocessing-redaction

Conversation

@eyeseast

Copy link
Copy Markdown
Contributor

This fixes a problem I was having locally where OCR and redaction failed because of missing binaries. More here: https://gist.github.com/eyeseast/434734582ae07ee5845abe968f7fc108

We now use a multi-stage build to compile dependencies in Ubuntu and then copy them into our python:3.12-slim image, so the processing scripts can find them.

This only affects local development, not production.

@duckduckgrayduck

duckduckgrayduck commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

I'd like to avoid patching in dependencies whenever possible. From what I understand here, tesseract running locally relies on certain dependencies only available in Ubuntu 20.04, the last time the local code was compiled. I think the better move here- and which actually matches what production does, is to make tesseract self contained when running locally. I have a branch that does this here:
https://github.com/MuckRock/documentcloud/tree/local_ocr_fix
I feel that this is better than patching the local.yml file as it is more explicit about what dependencies are exactly where they live and makes it more self contained. I have verified that OCR works locally with this branch

@mitchelljkotler

Copy link
Copy Markdown
Member

I think these two approaches are actually similar - one is doing a multistage docker build in order to grab the necessary libraries from the Ubuntu 20.04 image, while the other is just checking those libraries directly into git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants